AITopics | face track

Collaborating Authors

face track

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

UniTalk: Towards Universal Active Speaker Detection in Real World Scenarios

Nguyen, Le Thien Phuc, Yu, Zhuoran, Cao, Khoa Quang Nhat, Guo, Yuwei, Pham, Tu Ho Manh, Nguyen, Tuan Tai, Vo, Toan Ngo Duc, Poon, Lucas, Lee, Soochahn, Lee, Yong Jae

arXiv.org Artificial IntelligenceMay-29-2025

We present UniTalk, a novel dataset specifically designed for the task of active speaker detection, emphasizing challenging scenarios to enhance model generalization. Unlike previously established benchmarks such as AVA, which predominantly features old movies and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult real-world conditions. These include underrepresented languages, noisy backgrounds, and crowded scenes - such as multiple visible speakers speaking concurrently or in overlapping turns. It contains over 44.5 hours of video with frame-level active speaker annotations across 48,693 speaking identities, and spans a broad range of video types that reflect real-world conditions. Through rigorous evaluation, we show that state-of-the-art models, while achieving nearly perfect scores on AVA, fail to reach saturation on UniTalk, suggesting that the ASD task remains far from solved under realistic conditions. Nevertheless, models trained on UniTalk demonstrate stronger generalization to modern "in-the-wild" datasets like Talkies and ASW, as well as to AVA. UniTalk thus establishes a new benchmark for active speaker detection, providing researchers with a valuable resource for developing and evaluating versatile and resilient models. Dataset: https://huggingface.co/datasets/plnguyen2908/UniTalk-ASD Code: https://github.com/plnguyen2908/UniTalk-ASD-code

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2505.21954

Genre: Research Report (1.00)

Industry: Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (0.94)
Information Technology > Artificial Intelligence > Speech (0.69)
Information Technology > Artificial Intelligence > Robots (0.68)
(2 more...)

Add feedback

LASER: Lip Landmark Assisted Speaker Detection for Robustness

Nguyen, Le Thien Phuc, Yu, Zhuoran, Lee, Yong Jae

arXiv.org Artificial IntelligenceJan-21-2025

Active Speaker Detection (ASD) aims to identify speaking individuals in complex visual scenes. While humans can easily detect speech by matching lip movements to audio, current ASD models struggle to establish this correspondence, often misclassifying non-speaking instances when audio and lip movements are unsynchronized. To address this limitation, we propose Lip landmark Assisted Speaker dEtection for Robustness (LASER). Unlike models that rely solely on facial frames, LASER explicitly focuses on lip movements by integrating lip landmarks in training. Specifically, given a face track, LASER extracts frame-level visual features and the 2D coordinates of lip landmarks using a lightweight detector. These coordinates are encoded into dense feature maps, providing spatial and structural information on lip positions. Recognizing that landmark detectors may sometimes fail under challenging conditions (e.g., low resolution, occlusions, extreme angles), we incorporate an auxiliary consistency loss to align predictions from both lip-aware and face-only features, ensuring reliable performance even when lip data is absent. Extensive experiments across multiple datasets show that LASER outperforms state-of-the-art models, especially in scenarios with desynchronized audio and visuals, demonstrating robust performance in real-world video contexts. Code is available at \url{https://github.com/plnguyen2908/LASER_ASD}.

artificial intelligence, face track, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2501.11899

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Rethinking Audio-visual Synchronization for Active Speaker Detection

Wuerkaixi, Abudukelimu, Zhang, You, Duan, Zhiyao, Zhang, Changshui

arXiv.org Artificial IntelligenceJul-10-2022

Active speaker detection (ASD) systems are important modules for analyzing multi-talker conversations. They aim to detect which speakers or none are talking in a visual scene at any given time. Existing research on ASD does not agree on the definition of active speakers. We clarify the definition in this work and require synchronization between the audio and visual speaking activities. This clarification of definition is motivated by our extensive experiments, through which we discover that existing ASD methods fail in modeling the audio-visual synchronization and often classify unsynchronized videos as active speaking. To address this problem, we propose a cross-modal contrastive learning strategy and apply positional encoding in attention modules for supervised ASD models to leverage the synchronization cue. Experimental results suggest that our model can successfully detect unsynchronized speaking as not speaking, addressing the limitation of current models.

artificial intelligence, machine learning, synchronization, (17 more...)

arXiv.org Artificial Intelligence

2206.10421

Country:

Asia > China > Shaanxi Province > Xi'an (0.04)
Asia > China > Beijing > Beijing (0.04)
North America > United States > New York > Monroe County > Rochester (0.04)
(2 more...)

Genre: Research Report > New Finding (0.66)

Industry: Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Chimpanzee face recognition from videos in the wild using deep learning

#artificialintelligenceSep-5-2019, 10:06:45 GMT

Evaluation was performed on a held-out test set using the standard protocol outlined by Everingham et al. (39). The precision/recall curve was computed from a method's ranked output. Recall was defined as the proportion of all positive examples ranked above a given rank, while precision is the proportion of all examples above that rank which are from the positive class. For the purpose of our task, high recall was more important than high precision (i.e., false positives are less dangerous than false negatives) to ensure no chimpanzee face detections were missed. Some false positives, such as the recognition of chimpanzee behinds as faces (e.g., fig.

artificial intelligence, deep learning, machine learning, (18 more...)

#artificialintelligence

Country: Africa (0.28)

Genre: Research Report (0.46)

Industry: Information Technology (0.47)

Technology:

Information Technology > Artificial Intelligence > Vision > Face Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Face Clustering in Videos with Proportion Prior

Tang, Zhiqiang (Chinese Academy of Sciences) | Zhang, Yifan (Chinese Academy of Sciences) | Li, Zechao (Nanjing University of Science and Technology) | Lu, Hanqing (Chinese Academy of Sciences)

AAAI ConferencesJul-15-2015

In this paper, we investigate the problem of face clustering in real-world videos. In many cases, the distribution of the face data is unbalanced. In movies or TV series videos, the leading casts appear quite often and the others appear much less. However, many clustering algorithms cannot well handle such severe unbalance between the data distribution, resulting in that the large class is split apart, and the small class is merged into the large ones and thus missing. On the other hand, the data distribution proportion information may be known beforehand. For example, we can obtain such information by counting the spoken lines of the characters in the script text. Hence, we propose to make use of the proportion prior to regularize the clustering. A Hidden Conditional Random Field(HCRF) model is presented to incorporate the proportion prior. In experiments on a public data set from real-world videos, we observe improvements on clustering performance against state-of-the-art methods.

constraint, proportion, video, (15 more...)

AAAI Conferences

Twenty-Fourth International Joint Conference on Artificial Intelligence

Country: Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report > Promising Solution (0.34)

Industry:

Leisure & Entertainment (0.67)
Media > Television (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.88)

Add feedback